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Universitat de Barcelona 


In the current study, we evaluated various robust statistical methods for 
comparing two independent groups. Two scenarios for simulation were 
generated: one of equality and another of population mean differences. In 
each of the scenarios, 33 experimental conditions were used as a function of 
sample size, standard deviation and asymmetry. For each condition, 5000 
replications per group were generated. The results obtained by this study 
show an adequate type error I rate but not a high power for the confidence 
intervals. In general, for the two scenarios studied (mean population 
differences and not mean population differences) in the different conditions 
analysed, the Mann-Whitney U-test demonstrated strong performance, and a 
little worse the t-test of Yuen-Welch. 


In social sciences, and particularly in psychology, many of the applied 
research studies use parametric statistical tests to evaluate their expectations 
or hypotheses. However, in most cases, the adequacy of the use of those 
tests is not assessed, and the use of those tests is often of dubious validity 
because the assumptions of the statistical test are violated. A clear example 
is the assumption of normal distribution, which is often assumed, although 
observed distributions do not usually follow a normal distribution. In recent 
years, there have been increasing numbers of studies that pay attention to 
the assumptions of statistical tests, and researchers do not use parametric 
tests indiscriminately, but they use nonparametric tests or even perform 
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logarithmic or power transformations on the original variables to obtain 
new distributions that follow the normal distribution. Some researchers use 
the Monte Carlo permutation test when they work with small samples and 
with variables that do not follow a normal distribution, but the resultant 
performance is not as good as desired (Holmes-Finch and Davenport, 
2009). On the other hand, some researchers defend the use of the Bayesian 
approach in statistical decision-making (De la Fuente, Canadas, Guardia, 
and Lozano, 2009), although this option is far from being applied to 
psychological research. 

However, it is necessary to mention that in recent years, in the field of 
psychology, there has been an effort to encourage researchers to provide 
more information than the p-value when they present the results of their 
research. Therefore, it is becoming advisable to provide, in addition to the 
statistical test and the p-value, other statistical measures, such as the effect 
size, confidence intervals around the statistical test, and even statistical 
confidence intervals for effect size (APA, 2010; Bailar and Mosteller, 1988; 
Belia, Fidler, Williams and Cumming, 2005; Cohen, 1994; Cumming, 2008, 
2009; Cumming and Fidler, 2009; Cumming and Finch, 2001, 2005; 
Wi lk inson and the Task Force on Statistical Inference, 1999; Wolfe and 
Hanley, 2002) or robust estimations of the confidence intervals of the robust 
effect size (Algina, Keselman and Penfield, 2005; Keselman, Algina, Lix, 
Wilcox and Deering, 2008). In fact, in 1934, J. Neyman proposed the use of 
confidence intervals in statistical decision-making (Cowles, 1989). In the 
same manner, Hagen (1997) or Tryon (2001) indicated that confidence 
intervals contribute the same information as the hypothesis test, and 
Coulson, Healey, Fidler and Cumming (2010), Cumming (2008, 2009), 
Cumming and Fidler (2011) and Cumming and Maillardet (2006) remarked 
that confidence intervals support the inference without the need to 
formulate a null hypothesis. 

However, few works incorporate the use of the median confidence 
interval (Bonett and Price, 2002; Dubnicka, 2007; Lin, Newcombe, Lipsitz 
and Carter, 2009; Strelen, 2001, 2004; Wilcox, 2005; Woodruff, 1952). 
Bonett and Price (2002) indicate that the decision around the median is 
adequate when the distribution of the quantitative variable is biased and 
leptokurtic and the variable distribution does not fit the normal distribution; 
in those situations the analysis based on the mean is not always robust. 
Several works (Pero, Delgado and Guardia, 2011; Pero, Guardia, Freixa and 
Turbany, 2008) show a good performance for the median confidence 
intervals comparison to recognize the conditions around the true H 0 when 
comparing two independent groups, but a bad performance in the conditions 
around the false H 0 . 
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The main aim of this work is to study the benefit of confidence 
interval comparison for two independent groups when the distributions are 
asymmetrical. We compare the confidence intervals around the mean and 
the confidence intervals around the median to determine which analysis 
provides the best results in the comparison. For this comparison, we obtain 
the confidence intervals around the mean, the trimmed mean (Wilcox, 
2005) and the median. 

The confidence interval for the trimmed mean was obtained applying 
the following expression: 

X '“ w > (l-2r)4n 


where x t symbolizes the trimmed mean, S w the winsorized standard 
deviation and y the proportion of trimming in each tail. 

In the case of the median, we obtain the confidence interval using five 
different methods: the standard error method (Kendall, 1945; Mothes and 
Torres-Ibem, 1970), the binomial median confidence interval (Bland, 2003; 
DeCoster and Burchill, 2000), the McKean & Schraeder method, the Maritz 
& Jarret method and the adaptive kernel estimation to obtain the median 
confidence interval (Wilcox, 2005). 

According to Kendall (1945) and Mothes and Torrens-Ibern (1970), if 
the population is normal with mean a, standard deviation o and a 
sufficiently great sample, the median probability distribution tends to be a 
normal distribution with the following characteristics: 


e\ median] = // VAE\median\= - SE (median ) = 1.253 °/ 

In / 

Consequently, the interval is obtained by calculating the following 
formula: 


M/±1.253 -A 
\n 

In the case of small samples (which are common in the social 
sciences), the sampling distribution of the median is not known, even 
though authors such as Lane (1999) posit that the sampling distribution of 
the median is normal if the variable in the original population is distributed 
normally. 

The second method (the calculation of the median confidence interval 
from the binomial distribution) is based on the positions of the lower limit 
and the upper limit of the interval from the application of the binomial 
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distribution, given that the number of observations below the percentile k 
follows this distribution with parameters n and k (Bland, 2003, DeCoster 
and B urchill, 2000) and the median is the central point of distribution k = .5. 
It is defined by the following parameters: 

n + 1 

Position that occupies the value of the median: ——— 

Standard error: Jn p (l - p), where p = .5 

Consequently, the interval for the positions is obtained by calculating 
the following formula: 

^Y L± Ha,y) V” P 0 - P) 

Then, the positions are rounded off to the next whole number, and 
finally, the values of the observed distribution that occupy these positions 
are obtained. 


The equation proposed by McKean and Schraeder’s (1984) for 
obtaining the median’s confidence interval from the estimation of the 
median standard error is (Wilcox, 2005): 

SE = ( *<"-*-*>- ** ), where k = z. 

V 2 Z 995 / 2 

Maritz and Jarret’s (1978) equation for obtaining the median’s 
confidence interval from the estimation of the proposed standard error is 
(Wilcox, 2005): 



A %k+i "b (1 A) x k ■ A x n _k + (1 A) x n _k+± 


Where: 


J _ Yk- 1 - a 
Yk-Yk+i ’ 


A = 


( n—k)I 
k+ ( n-2k ) I 


Finally, the adaptive kernel estimation consist in compute a 
confidence interval for the median using an estimate of the standard error 
based on adaptive kernel density estimator, a good approximation of the 
true density with small samples (Wilcox, 2005). 


METHOD 

Procedure. The data analysed in this study were generated with the 
use of R software (R Development Core Team, 2010) under the assumption 
of a two independent groups design. In particular, two possible scenarios 
were simulated: one in which the population means were equal (100) and 
one with different population means (100 and 115, respectively). Both 
scenarios were simulated with equal population variances in the two groups 
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(100). For each scenario, 9 possible experimental conditions were studied 
with varying sample size values of each group ( n = 10, n = 30 or n = 50 but 
with both groups having the same sample size) and asymmetrically 
generated distributions (g x = .0, g x = .8, g y = .0 or g y = .8; the sub-index “x” 
refers to one group, and the sub-index “y” refers to the other group; for 
equal means the condition of g x = .8, g y = .0 was not studied). Also, for both 
scenarios, another 24 experimental conditions were studied, different 
sample size (n x = 10, n y = 30 and n x = 30, n. = 10), equal or different 
variances (a x = o y = 10, o x = 5, o y = 10 and o x - 10, o y = 5) and 
asymmetrically generated distributions (g x = .0, g x = .8, g y = .0 or g y = .8). 

In the simulation of each condition, 5000 replications for each group 
were generated according to the model of the standard normal distribution 
[R code: x [i, ] <-sort (rnorm(10) ) y[i,]<- 

sort(rnorm(10)), we change the value 10 by 30 or 50 to simulate the 
different sample sizes worked]. In a second step, the asymmetry in the 
distribution was generated by applying the formula of the distribution gh 
(Field and Genton, 2006 or Hoaglin, 1985): 



9 


where g indicates the asymmetry that can be generated, and h indicates the 
kurtosis that can be generated in the normal distribution. In both cases, a 0 
would indicate a perfectly symmetrical and mesokurtic distribution (for all 
the simulations h was fixed to 0). As the parameters move away from 0, the 
asymmetry increases and the kurtosis of the curve increases or decreases 
(Field and Genton, 2006; Hoaglin, 1985 or Wilcox, 2005). Asymmetry was 
generated in the two comparison groups (the “x” group and the “y” group 
for both scenarios). Finally, the third step in the simulation consisted of 
multiplying the distribution by the value of standard deviation and adding 
the mean to obtain a distribution with a mean of 100 or 115 and an original 
standard deviation of 10 or 5. 

In tables 1 and 2 we show some descriptive indicators of the 
simulated distributions for all the experimental conditions generated. 

For each simulation, we compute Student’s t-test of independent 
groups (t-test), Yuen-Welch’s t-test (Wilcox, 2005), the nonparametric 
Mann-Whitney U-test, the mean confidence intervals, the trimmed mean 
confidence interval, the medians confidence interval from the standard error 
(Kendall, 1945), the binomial median confidence intervals, the medians 
confidence interval based on the standard error from the McKean and 
Schraeder method (1984), the Marizt and Jarret method (1978) and the 
adaptive-kernel density estimation (Wilcox, 2005). 
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mean generated in group “y”; a y : population standard deviation generated in group “y”; g y : asymmetry generated in group “y”; Md: median; T mean: 25% 
trimmed mean; SD: standard deviation; As: asymmetry. 
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Data analysis. The data analysis consisted of obtaining the 
percentage of errors in the statistical decision using the t-test of the 
independent groups (according with the homocedasticity condition) for the 
5000 comparisons in the 33 conditions for both scenarios, the 
nonparametric Mann-Whitney U-test, Yuen-Welch’s t-test (Wilcox, 2005) 
and the comparison of the confidence intervals of means and medians in the 
seven used procedures. The significance level was set at an alpha of 5% for 
all hypotheses test. The statistical decisions of the compared confidence 
intervals were made according to two criteria: one non-strict and one strict. 
According to the first procedure, the null hypothesis was rejected if the 
statistics (mean or median) of the first group (“x”) were not within the 
generated confidence interval in the second group (“y”), and the statistics of 
the second group were not within the generated confidence interval in the 
first group. As for the second procedure, the null hypothesis was rejected in 
the case of no overlap between the two confidence intervals generated. 
Finally, for each condition and each scenario, the percentage of replications 
with an incorrect decision was obtained. 


RESULTS 

We show the results obtained from the generated simulations in 
Tables 3 to 6. In Tables 3 and 4 we show the results for the scenario of no 
mean population differences, and in Tables 5 and 6 we show the results for 
the scenario of mean population differences. 

We can see in Table 3 that when we work with samples sizes of 30 or 
50 subjects in each group and generate asymmetry in “y” group, classical 
statistical tests, such as the t-test for independent groups and the 
nonparametric Mann-Whitney U-test, have a high error rate in their 
decisions. In fact, this error is greater than 20% in the case of the t-test and 
over 10% in the case of the nonparametric Mann-Whitney U-test. It should 
be noted that in the case of Yuen-Welch’s t-test, the error rate is around the 
nominal alpha of 5%. If you look at the decision from the comparison of 
confidence intervals following the non-strict decision criterion (mean or 
median not inside the confidence intervals computed in each group), we can 
say that the strategy that presents fewer incorrect decisions is the 
comparison made for the binomial median confidence intervals. In the case 
of median confidence intervals computed from McKean and Schraeder’s 
estimation, Marizt and Jarret’s estimation or kernel adaptive estimation, it is 
necessary to note that when the sample size increases, the percentage of 
incorrect decisions also increases, and this percentage is approximately 10% 
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when we work with 50 subjects in each group. Finally, we would to 
comment that if the decision is taken based on the comparison of mean 
confidence intervals, either from the mean or the trimmed mean, the 
percentage of incorrect decisions is higher than when comparing median 
confidence intervals, regardless of the method used to estimate the median 
confidence interval. 

When we use the decision strict criterion for comparing the two 
confidence intervals (non-overlapping confidence intervals), if the decision 
is taken based on the comparison of median confidence intervals, regardless 
the strategy used to compute these confidence intervals, or if the decision is 
based on the comparison of trimmed mean confidence intervals, the rate of 
incorrect decisions, in general, is less than 1% for the 9 studied conditions. 
It is important to note that when we compare the mean confidence intervals, 
the error rate is approximately 10% for the condition of asymmetry in both 
groups and sample size of 50 subjects; for the other conditions, the error 
rate is less than the nominal alpha value of 5%. 

We can see in Table 4, that the best performance is for t-test of Yuen- 
Welch and Mann-Whitney U-test. In the last case, for the group of little 
sample size (n =10) and greater standard deviation (a = 10), the error rate 
is around 10%. In relation to the confidence intervals, the performance is in 
general good, except for the mean confidence interval. 

In the scenario of mean population differences (Tables 5 and 6), the 
results are not as easy to comment upon as in the scenario of no mean 
population differences. In general the best performance is for the Mann- 
Whitney U-test and the t-test of Yuen-Welch. However it is necessary to 
comment, that their performance is no good for small sample sizes (n =10) 
with standard deviation of 10 and when there is asymmetry in the “x” 
group. In relation to the comparison of confidence intervals, in general their 
performance is bad and irregular (in some conditions they have a good 
performance, for example using the non-strict criterion when n x = 30 and 
n y = 10). 
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intervals, EE Md Cl: median confidence intervals according to the standard error. Binomial Md Cl: median confidence intervals according to the binomial 
distribution, Mks Md Cl: median confidence intervals by McKean and Schraeder’s estimation, MJ Md Cl: median confidence intervals by Marizt and Jarret’s 
estimation and k Md Cl: median confidence intervals by the adaptive-kernel estimation. 
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distribution, Mks Md Cl: median confidence intcn-als by McKean and Sehraeder’s estimation, MJ Md Cl: median confidence intervals by Marizt and Jarret's 
estimation and k Md Cl: median confidence intervals by the adaptive-kernel estimation. 
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DISCUSSION 

The first conclusion that we obtain from this work is that the 
comparison of confidence intervals based on the median has a low type I 
error rate, regardless of the method used to compute the confidence interval, 
but this method’s power is not as good as expected. In fact, this result has 
also been obtained by Pero et al. (2011) and Pero et al. (2008). Also, these 
results are in line with the results obtained by Holmes-Finch and Davenport 
(2009) in the use of a Monte Carlo permutation for small samples when the 
dependent variable does not follow a normal distribution in a MANOVA 
design. 

Although we noted previously that the decision from the comparison 
median confidence intervals does not exhibit the expected results, it is 
important to note that when there is no mean population differences, their 
performance is good, and in the scenario of mean population differences 
when working with large samples sizes (more than 30 subjects in each 
group), the decision based on the comparison of median confidence 
intervals from a binomial distribution, McKean and Schraeder estimation, 
Marizt and Jarret estimation and adaptive-kernel estimation have good 
power, if the decision criterion is non-strict. 

If we had to recommend a statistical decision strategy for comparing 
two independent groups, based on our results, the best strategies are Mann- 
Whitney U-test and the t-test of Yuen Welch. The computation of this last 
test is easily implemented in R [R code: 
yuen (x, y, tr=0.2, alpha=0.05) ]. This result is congruent because 
the Mann-Whitney U-test is a non-parametric test adequate for 
asymmetrical distributions and the computation of t-test of Yuen-Welch is 
based on robust and resistant measures, namely, the trimmed mean and the 
winsorized variance (Wilcox, 2005). 

Complementarily, it is important to point out the bias effect that some 
coefficients of symmetry could generate in the data simulated when we 
generate complex distributions. Our results show that in very small samples 
this effect could generate some relevant distortion. 

Our results lead us to reflect about the statistical decision. In this 
work, we present empirical evidence concerning the incorrect performance 
of the t-test. However, this test is not the only statistical tests that could 
have a wrong performance when the conditions of application are violated 
(for example, ANOVA and the chi-square test), and there are a few works 
that present empirical evidence about their performance (e.g., Holmes-Finch 
and Davenport, 2009). We should rethink the use of traditional statistical 
tests in applied research; these tests are adequate when all of the 
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assumptions are satisfied, but the fulfilment of these assumptions is not the 
norm in applied research; for example, the random sampling of the units 
studied is often not satisfied. 

It is possible that in these decisions, we can act as Homo Heuristicus 
and use only the useful information to make a decision (Gigerenzer, 1991; 
Gigerenzer and Brighton, 2009; and Goldstein and Gigerenzer, 2002). 
Moreover, it is necessary to comment that in this work, we consider the use 
of confidence intervals based on the mean or the median to be similar to 
null hypothesis significance testing (NHST), and this approach may be 
complemented, as Coulson et al. (2010) and Cumming and Fidler (2011) 
propose. As those authors note, the correct option may be not to interpret 
the confidence intervals subsumed to the 0 difference and to consider the 
precision of the intervals and their graphical representation. 


RESUMEN 

Estudio de la adecuacion de diferentes pruebas estadisticas robustas 
para la comparacion de dos grupos independientes. En el presente 
estudio, se evaluan diferentes pruebas estadisticas robustas para la 
comparacion de dos grupos independientes. Se han generado dos escenarios 
de simulacion: uno para igualdad de medias poblacionales y otro para 
desigualdad. Para cada escenario se han utilizado 33 condiciones 
experimentales manipulando los valores de tamano de muestra, desviacion 
estandar y asimetrfa. Para cada condicion, se han generado 5000 
replicaciones por grupo. Los resultados obtenidos muestran una tasa 
adecuada de error tipo I pero la potencia asociada a los intervalos de 
confianza no es adecuada. En general, para los dos escenarios estudiados 
(diferencias y no diferencias de medias poblacionales) en las diferentes 
condiciones analizadas, la prueba U de Mann-Whitney es la que presenta el 
mejor rendimiento, y un poco peor la prueba t de Yuen-Welch. 
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